[Bugfix] Fix guided decoding with tokenizer mode mistral #11046

wallashss · 2024-12-10T00:52:10Z

This PR address the issues on #11045.

Changelog:

Fix mistral tokenizer when decode is overwritten by outlines processors, the adapter change this method and when the tokenizer calls this function it gets a wrong response.
Added support for mistral tokenizer with XGrammar
Added tests to cover mistral tokenizer mode
Minor refactor on the xgrammar decoding
Fix tests on test_guided_processors and added then to CI which were not executing to validate changes

FIX #11045 (link existing issues this PR will resolve)

Signed-off-by: Wallas Santos <[email protected]>

github-actions · 2024-12-10T00:52:23Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

mgoin

Thanks for working on this and trying to clean up along the way. My main concern is with using the tokenizer's vocab size - I think this may not be padded to the model's final logit size. Also the backend_str/vocab_type logic is a little confusing now.

mgoin · 2024-12-10T01:08:32Z

vllm/model_executor/guided_decoding/xgrammar_decoding.py

-                       vocab_size=model_config.hf_config.vocab_size,
+                       vocab_size=tokenizer.vocab_size,


@aarnphm there is a reason why we needed to reference the model's vocab size and not the tokenizers, correct?

Cool, that was a shot, I knew that someone could clarify this. I have already have a conflict on this part #11043 recently merged so I will revert this change. But would be nice to have documented the difference of these vocab sizes. @aarnphm do you have any reference for that?

Thanks for check it out @mgoin! I really appreciate how fast you answered to this.

Also the backend_str/vocab_type logic is a little confusing now.

Yeah, but the problem of this part is it was not correct. I reported a traceback on #11045.

from_huggingface(): incompatible function arguments. The following argument types are supported: (arg0: list[str], arg1: str, arg2: Optional[int], arg3: Optional[list[int]]) -> xgrammar.xgrammar_bindings.TokenizerInfo

First, backend_str is typed with str and was assigned with a Enum value (VocabType). And this function from_huggingface does not receive vocabType, and that's why it crashes using Mistral Tokenizer and possibly other types of tokenizers as well. So, for these cases we have call the contructor of TokenizerInfo and pass a vocabType (and no backend_str), which may be RAW for tokenizer like tikToken and BYTE_FALLBACK for mistral. I think maybe I can make it more explicit, first time I wrote a solution for that I added a bool field on TokenizerData like is_from_huggingface, but I decided to make it more concise and infer that based on the value of these fields (which must be mutually exclusive). Of course I can add more comments to clarify and add asserts to prevent wrong behavior.

What do you think?

We need to reference tokenizer vocab size because of additional padding tokens.

this is the thread https://vllm-dev.slack.com/archives/C07QQ8DAXMK/p1732673561777159

Tks! I reverted the code.

mergify · 2024-12-10T11:12:47Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @wallashss.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Wallas Santos <[email protected]>

vllm/model_executor/guided_decoding/xgrammar_decoding.py

Signed-off-by: Wallas Santos <[email protected]>

aarnphm

Just a few finals requests from my end, otherwise LGTM

aarnphm · 2024-12-12T07:56:17Z

vllm/model_executor/guided_decoding/__init__.py

@@ -115,25 +116,9 @@ async def get_guided_decoding_logits_processor(
 def get_local_guided_decoding_logits_processor(
        guided_params: GuidedDecodingParams, tokenizer: PreTrainedTokenizer,
        model_config: ModelConfig) -> LogitsProcessor | None:
-    guided_params = maybe_backend_fallback(guided_params)


can you revert this change?

This is being used for offline use case, with LLM, where as get_guided_decoding_logit_processor is being used for online usecase.

I reviewed what I did and checked that it was not so good based on the difference of implementation of the methods get_local_outlines_guided_decoding_logits_processor and get_outlines_guided_decoding_logits_processor. But I tried something a little bit difference to not revert everything, just to avoid code duplication. See if you agree, if not I won't insist I can revert with no problem. Also I updated the tests to check the offline and online version to pass all over these code paths, considering the offline path.

I think this is even more confusing now that there are three functions. I would prefer a revert as it seems you have no other changes to this file? We can consider refactor in another PR

vllm/model_executor/guided_decoding/xgrammar_decoding.py

Signed-off-by: Wallas Santos <[email protected]>

mergify · 2024-12-12T22:37:57Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @wallashss.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Wallas Santos <[email protected]>

aarnphm · 2024-12-13T06:38:39Z

vllm/model_executor/guided_decoding/xgrammar_decoding.py

+    # These fields are mutually exclusive: `backend_str` is used to create a
+    # TokenizeInfo with `TokenizerInfo.from_huggingface` while `vocab_type` is
+    # used within the constructor of TokenizeInfo
+    backend_str: str | None = None
+    vocab_type: xgr.VocabType | None = None
+
+    def __post_init__(self):
+        # Check for mutual exclusive
+        assert not (self.backend_str and self.vocab_type), \
+            "backend_str and vocab_type are mutual exclusive"


hmm, is it possible to just either use backend_str or vocab_type here?

the cartesian product is probably harder to document.

hmm, is it possible to just either use backend_str or vocab_type here?

The way get_tokenizer_data was implemented, I don't see another way. The method from_hugging_face from xgrammar have several paths, and depending the type of the tokenizer it needs backend_str or vocab_type.

the cartesian product is probably harder to document.

I'm sorry, what you mean by cartesian product? Can you elaborate more?

There are two mutual exclusive options, which will have different combinations across different multiples.

Considering how much models vLLM is currently supported, in this case, we will have to draw the table of what combinations work/doesn't work, hence the cartesian product of combinations.

Ok got it. Do you think documentation is required for this PR? My primary goal was to address the issue with guided decoding when using tokenizer_mode mistral, as mentioned in the related PR. Without this fix, vLLM crashes for xgrammar due to an unhandled code path.

In my opinion, it might be beneficial to apply these fixes and then assess whether extensive documentation or code checks are required to help users understand unsupported features. What do you think?

mgoin

Nice I think this LGTM once you remove the xgrammar import in the test and revert vllm/model_executor/guided_decoding/__init__.py

mgoin · 2024-12-16T18:09:59Z

tests/model_executor/test_guided_processors.py

 from vllm.model_executor.guided_decoding.outlines_logits_processors import (
    JSONLogitsProcessor, RegexLogitsProcessor)
+from vllm.model_executor.guided_decoding.xgrammar_decoding import TokenizerData


Importing from xgrammar_decoding.py will also import+require xgrammar, in addition to your import xgrammar as xgr line below. I would recommend making your test_pickle_xgrammar_tokenizer_data function skip if xgrammar isn't able to be imported, and simply including these two imports within that function.

In fact it may be best to add a dedicated xgrammar testing file

Thanks for the change but you need to move from vllm.model_executor.guided_decoding.xgrammar_decoding import TokenizerData to the try-except as well

mgoin · 2024-12-16T18:12:35Z

vllm/model_executor/guided_decoding/__init__.py

@@ -115,25 +116,9 @@ async def get_guided_decoding_logits_processor(
 def get_local_guided_decoding_logits_processor(
        guided_params: GuidedDecodingParams, tokenizer: PreTrainedTokenizer,
        model_config: ModelConfig) -> LogitsProcessor | None:
-    guided_params = maybe_backend_fallback(guided_params)


I think this is even more confusing now that there are three functions. I would prefer a revert as it seems you have no other changes to this file? We can consider refactor in another PR

Signed-off-by: Wallas Santos <[email protected]>

wallashss · 2024-12-16T19:32:08Z

Thanks for the review @mgoin @aarnphm.

I think this is even more confusing now that there are three functions. I would prefer a revert as it seems you have no other changes to this file? We can consider refactor in another PR

NP, It made more sense to me when I tried to remove the model_config.

I would recommend making your test_pickle_xgrammar_tokenizer_data function skip if xgrammar isn't able to be imported, and simply including these two imports within that function.

Added the conditional skip

In fact it may be best to add a dedicated xgrammar testing file

Agree, I added a TODO, I guess it make sense to do that in another PR.

Signed-off-by: Wallas Santos <[email protected]>

mgoin

Nice LGTM!

DarkLight1337 · 2024-12-17T08:53:06Z

It appears that AMD CI isn't using the correct version of outlines, or one of its dependencies.

wallashss · 2024-12-17T14:55:38Z

Hey @DarkLight1337 tks for checking this.

I did a quick search and I think there is a conflict with this lark lib. I could reproduce this import error in my environment if I do a pip install lark-parser (found similar issues by google it). I am not sure what is going on on this AMD image, ~~and unfortunately I cannot pull this image to investigate~~.

wallashss · 2024-12-17T16:09:10Z

I checked the image of the CI for AMD and found this:

lark==0.12.0

In my environment I have

lark==1.2.2

Not sure the best way to fix besides set it in requirements... Any suggestion?

DarkLight1337 · 2024-12-17T16:13:20Z

I guess you just have to update the requirements file.

Signed-off-by: Wallas Santos <[email protected]>

mergify · 2024-12-17T16:19:42Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @wallashss.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Wallas Santos <[email protected]>

wallashss · 2024-12-17T16:22:29Z

@DarkLight1337 updated it. Let's see what CI says now.

…t#11046) Signed-off-by: Sage Moore <[email protected]>

[Bugfix] Fix guided decoding with tokenizer mode mistral

83ea81c

Signed-off-by: Wallas Santos <[email protected]>

wallashss requested review from DarkLight1337, ywang96, zhuohan123, youkaichao, alexm-neuralmagic, comaniac and njhill as code owners December 10, 2024 00:52

mergify bot added the ci/build label Dec 10, 2024

mgoin reviewed Dec 10, 2024

View reviewed changes

mergify bot added the needs-rebase label Dec 10, 2024

wallashss added 3 commits December 10, 2024 14:21

Merge branch 'main' into fix_guided_dec_with_mistral_tokenizer_mode

a9a1e3c

revert tokenizer.get_vocab_size

710fcc9

Signed-off-by: Wallas Santos <[email protected]>

reverted tokenizer get_vocab_size and adjusted tests

9792cee

Signed-off-by: Wallas Santos <[email protected]>

mergify bot removed the needs-rebase label Dec 10, 2024

wallashss commented Dec 10, 2024

View reviewed changes

vllm/model_executor/guided_decoding/xgrammar_decoding.py Show resolved Hide resolved

wallashss added 3 commits December 10, 2024 15:13

fix linting

b3cb571

Signed-off-by: Wallas Santos <[email protected]>

minor refactor on xgrammar_decoding

4ce6b28

Signed-off-by: Wallas Santos <[email protected]>

code cleanup

d7c7161

Signed-off-by: Wallas Santos <[email protected]>

aarnphm suggested changes Dec 12, 2024

View reviewed changes

answering to reviews suggestion

d61257d

Signed-off-by: Wallas Santos <[email protected]>

mergify bot added the needs-rebase label Dec 12, 2024

Merge branch 'main' into fix_guided_dec_with_mistral_tokenizer_mode

b674647

Signed-off-by: Wallas Santos <[email protected]>

mergify bot removed the needs-rebase label Dec 12, 2024

aarnphm reviewed Dec 13, 2024

View reviewed changes

mgoin reviewed Dec 16, 2024

View reviewed changes

applying PR suggestions

78e7dc2

Signed-off-by: Wallas Santos <[email protected]>

fix import tokenizer_data on test

bfedea7

Signed-off-by: Wallas Santos <[email protected]>

mgoin approved these changes Dec 16, 2024

View reviewed changes

mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label Dec 16, 2024

set lark to 1.2.2

0173c2d

Signed-off-by: Wallas Santos <[email protected]>

mergify bot added the needs-rebase label Dec 17, 2024

Merge branch 'main' into fix_guided_dec_with_mistral_tokenizer_mode

b98f633

Signed-off-by: Wallas Santos <[email protected]>

mergify bot removed the needs-rebase label Dec 17, 2024

simon-mo merged commit 8b79f9e into vllm-project:main Dec 18, 2024
72 of 75 checks passed

SageMoore pushed a commit to neuralmagic/vllm that referenced this pull request Dec 19, 2024

[Bugfix] Fix guided decoding with tokenizer mode mistral (vllm-projec…

8c9aa7a

…t#11046) Signed-off-by: Sage Moore <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bugfix] Fix guided decoding with tokenizer mode mistral #11046

[Bugfix] Fix guided decoding with tokenizer mode mistral #11046

wallashss commented Dec 10, 2024 •

edited by github-actions bot

Loading

github-actions bot commented Dec 10, 2024

mgoin left a comment

mgoin Dec 10, 2024

wallashss Dec 10, 2024

wallashss Dec 10, 2024 •

edited

Loading

aarnphm Dec 10, 2024 •

edited

Loading

wallashss Dec 10, 2024

mergify bot commented Dec 10, 2024

aarnphm left a comment

aarnphm Dec 12, 2024

wallashss Dec 12, 2024

mgoin Dec 16, 2024

mergify bot commented Dec 12, 2024

aarnphm Dec 13, 2024

wallashss Dec 16, 2024

aarnphm Dec 16, 2024

wallashss Dec 16, 2024

mgoin left a comment

mgoin Dec 16, 2024

mgoin Dec 16, 2024

mgoin Dec 16, 2024

wallashss commented Dec 16, 2024

mgoin left a comment

DarkLight1337 commented Dec 17, 2024 •

edited

Loading

wallashss commented Dec 17, 2024 •

edited

Loading

wallashss commented Dec 17, 2024 •

edited

Loading

DarkLight1337 commented Dec 17, 2024

mergify bot commented Dec 17, 2024

wallashss commented Dec 17, 2024

		vocab_size=model_config.hf_config.vocab_size,
		vocab_size=tokenizer.vocab_size,

[Bugfix] Fix guided decoding with tokenizer mode mistral #11046

[Bugfix] Fix guided decoding with tokenizer mode mistral #11046

Conversation

wallashss commented Dec 10, 2024 • edited by github-actions bot Loading

github-actions bot commented Dec 10, 2024

mgoin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wallashss Dec 10, 2024 • edited Loading

Choose a reason for hiding this comment

aarnphm Dec 10, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mergify bot commented Dec 10, 2024

aarnphm left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mergify bot commented Dec 12, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mgoin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wallashss commented Dec 16, 2024

mgoin left a comment

Choose a reason for hiding this comment

DarkLight1337 commented Dec 17, 2024 • edited Loading

wallashss commented Dec 17, 2024 • edited Loading

wallashss commented Dec 17, 2024 • edited Loading

DarkLight1337 commented Dec 17, 2024

mergify bot commented Dec 17, 2024

wallashss commented Dec 17, 2024

wallashss commented Dec 10, 2024 •

edited by github-actions bot

Loading

wallashss Dec 10, 2024 •

edited

Loading

aarnphm Dec 10, 2024 •

edited

Loading

DarkLight1337 commented Dec 17, 2024 •

edited

Loading

wallashss commented Dec 17, 2024 •

edited

Loading

wallashss commented Dec 17, 2024 •

edited

Loading